Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

نویسندگان

Peter Anderson

Xiaodong He

Chris Buehler

Damien Teney

Mark Johnson

Stephen Gould

Lei Zhang

چکیده

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The effect of bottom-up and top-down auditory program training on the development of children's auditory processing skills

Although there have been several previous investigations on the role of auditory training for the development of auditory processing skills, it still remains unknown whether children with auditory processing difficulties can get improved auditory skills after exposure to a multi-modal training experience comprising both visual and tactile stimuli. The present study, therefore, attempted to use ...

متن کامل

The effect of bottom-up and top-down auditory program training on the development of children's auditory processing skills

متن کامل

Compressed-Sampling-Based Image Saliency Detection in the Wavelet Domain

When watching natural scenes, an overwhelming amount of information is delivered to the Human Visual System (HVS). The optic nerve is estimated to receive around 108 bits of information a second. This large amount of information can’t be processed right away through our neural system. Visual attention mechanism enables HVS to spend neural resources efficiently, only on the selected parts of the...

متن کامل

The Effect of Bottom-up/Top- down Techniques on Lower vs. Upper -Intermediate EFL Learners’ Listening Comprehension

Listening is regarded as an interactive process involving decoding of information. This study was launched to find out the impact of bottom-up (BU) and top-down (TD) techniques on Iranian lower and upper intermediate learners’ listening comprehension. We selected a total of 120 participants in six intact classes, three lower intermediate and three upper intermediate. The proficiency level of th...

متن کامل

Visual Question Answering using Deep Learning

Multimodal learning between images and language has gained attention of researchers over the past few years. Using recent deep learning techniques, specifically end-to-end trainable artificial neural networks, performance in tasks like automatic image captioning, bidirectional sentence and image retrieval have been significantly improved. Recently, as a further exploration of present artificial...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

نویسندگان

چکیده

منابع مشابه

The effect of bottom-up and top-down auditory program training on the development of children's auditory processing skills

The effect of bottom-up and top-down auditory program training on the development of children's auditory processing skills

Compressed-Sampling-Based Image Saliency Detection in the Wavelet Domain

The Effect of Bottom-up/Top- down Techniques on Lower vs. Upper -Intermediate EFL Learners’ Listening Comprehension

Visual Question Answering using Deep Learning

عنوان ژورنال:

اشتراک گذاری